Feature Based Classification of Protein Docking Sites: An Algorithm for Large Databases and Experimental Results

نویسندگان

  • Martin Ester
  • Hans-Peter Kriegel
  • Stefan Wirth
چکیده

(SA_6 = very-low) or strongly hydrophob (HY_PHOB = very-high) segments of the surface are likely to be docking sites. On the other hand, if a surface segment is very concave (SA_6 = very-high) or plane and not strongly hydrophob (SA_6 = medium ∧ HY_PHOB = low) or convex and not strongly hydrophob (SA_6 = low ∧ HY_PHOB = low) then it usually is not a docking site. Thus, different types of docking and non-docking sites were discovererd. The classification algorithm always chose SA_6 as the first attribute and typically chose HY_PHOB as the second one. We conclude that SA_6 and HY_PHOB include the most information (w.r.t. the features investigated) for the classification of docking sites. Conclusions We have developed an efficient classification algorithm for large databases. The algorithm was used to classify protein docking sites based on a set of geometric and physico-chemical surface features. Although only a small set of features was investigated, our experiments yielded some interesting and useful classification rules. More features should be considered to increase the confidence of the rules. The following features seem to be potentially useful: carbon, oxygen, and sulphur content, hydrogen-ion-donor or acceptor property and aminoacid content in a defined neighborhood of a surface point. Some major improvements of ID3 were introduced [Wir 95] for application on large databases: • ID3 discovers only rules with a confidence of 1.0. However, from biodatabases such rules can hardly be extracted, and rules with lower confidence may be rather interesting. Therefore, we allow to search for rules with any minimum confidence which has to be specified by the user. • In ID3, the condition of a rule is a conjunction of atomic conditions. We allow disjunctions of "neighbouring" attribute values in a condition in order to increase the support of a rule if needed. The user defines the minimum support required for all rules. • ID3 explores the decision tree in a depth-first manner resulting in a large number of passes through the database. Since these passes are expensive for large databases, our algorithm uses a breadth-first search minimzing the number of database passes. Experiments We have performed classification experiments using the BIOWEPRO protein database [KSS 95]. Feature values have been clustered into five intervals to yield symbolic attribute values. To obtain a training set, 22 representative proteins with known docking sites were selected. For each surface point of these proteins, an additional attribute had …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature selection using genetic algorithm for classification of schizophrenia using fMRI data

In this paper we propose a new method for classification of subjects into schizophrenia and control groups using functional magnetic resonance imaging (fMRI) data. In the preprocessing step, the number of fMRI time points is reduced using principal component analysis (PCA). Then, independent component analysis (ICA) is used for further data analysis. It estimates independent components (ICs) of...

متن کامل

Modeling and design of a diagnostic and screening algorithm based on hybrid feature selection-enabled linear support vector machine classification

Background: In the current study, a hybrid feature selection approach involving filter and wrapper methods is applied to some bioscience databases with various records, attributes and classes; hence, this strategy enjoys the advantages of both methods such as fast execution, generality, and accuracy. The purpose is diagnosing of the disease status and estimating of the patient survival. Method...

متن کامل

Fast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets

Biomedical datasets usually include a large number of features relative to the number of samples. However, some data dimensions may be less relevant or even irrelevant to the output class. Selection of an optimal subset of features is critical, not only to reduce the processing cost but also to improve the classification results. To this end, this paper presents a hybrid method of filter and wr...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

Identification of RNA-binding sites in artemin based on docking energy landscapes and molecular dynamics simulation

There are questions concerning the functions of artemin, an abundant stress protein found in Artemiaduring embryo development. It has been reported that artemin binds RNA at high temperatures in vitro, suggesting an RNA protective role. In this study, we investigated the possibility of the presence of RNA-bindingsites and their structural properties in artemin, using docking energy ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996